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Abstract 

O 

We show that for every large enough integer N, there exists an N-point subset of L\ such 
that for every D > 1, embedding it into if with distortion D requires dimension d at least 
Afn(i/D 2 \ and that for every e > and large enough integer N, there exists an N-point sub- 
set of L\ such that embedding it into if with distortion 1 + e requires dimension d at least 
N i-0(i/log(i/e))_ These results were previously proven by Brinkman and Charikar [JACM, 
2005] and by Andoni, Charikar, Neiman, and Nguyen [FOCS 2011]. We provide an alternative 
and arguably more intuitive proof based on an entropy argument. 

1 Introduction 

We prove the following theorem. 

m 

Theorem 1.1. For every large enough integer N, there exists an N-point subset ofL\ such that for every 

D > 1, embedding it into if with distortion D requires dimension d at least N n ^ 1/D2 \ Moreover, for every 

e > and large enough integer N, there exists an N-point subset ofL\ such that embedding it into if with 

1 1 distortion 1 + e requires dimension d at least N 1_ °( 1/ lo s( 1/!; )). 
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Both parts of Theorem 1.1 were previously known. The first part (embedding with large dis- 
t— i tortion) was first shown by Brinkman and Charikar [BC05], and later with a simpler proof by Lee 

j> and Naor |LN04| . The second part (embedding with low distortion) was recently shown by An- 

doni, Charikar, Neiman, and Nguyen flACNNllff . Our proof is based on an entropy argument, 
5— i and is arguably more intuitive. 

The set of points we use is identical to the one used by Andoni et al. RACNNllfl . For complete- 
ness, we briefly describe it here (see also Figure[l]for an illustration). For integers k > 2, n > 1, we 
define the so-called "recursive cycle" graph G^ n , and associate with each vertex a label in {0, l} k " . 
The set of all labels will be our point set P^ n in l\ . First, for k > 2, let be the cycle of length 
2k, with two distinguished antipodal vertices (i.e., of distance k), call them "left" and "right". For 
< i < k, the zth vertex on the top path from the left to the right vertex is labeled with the vector 
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Figure 1: G32 with our labeling and orientation of the edges and the labels on vertices in {0, l} 9 . 



(0, ... ,0,1, ... ,1) with k — i zeros and i ones, and the z'th vertex on the bottom path is associated 
with the vector (1, ... ,1,0, ... ,0) with i ones and k — i zeros. Notice that the l\ distance between 
the labels of any two adjacent vertices is 1, whereas that between the labels of any two antipodal 
vertices is k. 

For n > 2, define G; c n as the graph obtained from G/ C/ „_i by replacing each edge with a copy 
of G^i and identifying the distinguished vertices with the original endpoints of the edge. The 
number of vertices in G/ C/ „ is easily seen to be 

For the labels, we first take the labels in G^ n _i and duplicate each coordinate k times. This defines 
the labels for those vertices coming from G^n-i- For the newly added vertices on each cycle that 
replaced an edge of Gk, n -i, we replace the k coordinates on which the two distinguished nodes 
of that cycle differ with the same labeling of Gjy described earlier. Notice the following two 
properties: the i\ distance between the labels of any two adjacent vertices is 1, and for 1 < I < n, 
the distance between any two antipodal vertices in level £ is k n +1 . We remark that these two 
properties are also satisfied by the shortest path metric on G^,,, but since that metric is not in t\, it 
is not good enough for the purpose of proving dimension reduction in l\ . 

Finally, we label the edges of G^i by elements of [2k] starting from the left vertex and going 
along the cycle, and extend this to a labeling of G^ n by elements of [2k] n in a recursive way, with the 
coordinates labeling the location of the edge from the top layer to the bottom layer (see Figure [T]). 

The idea of the proof is the following. Given a low-distortion embedding of P^ in into t\, we nat- 
urally obtain a mapping that maps each edge of the graph G^ „ to a d-dimensional vector (namely, 
the difference between the two embedded endpoints) whose £\ norm is close to 1. Assume for sim- 
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plicity that this norm is exactly 1; assume moreover that the vector has non-negative coordinates. 
(In the proof we will show how to reduce the general case to this case.) So we can equivalently 
view this mapping as an encoding from [2k] " to probability distributions over [d ] . Using the sec- 
ond property mentioned above, one can obtain the following crucial property of the encoding: 
For any i G [n] and any X\, . . . , X£_\ G [2k], if we are given X\, . . . , Xn_\ together with the encoding 
of (x\, . . . , x n ) G [2k] n , where Xf,...,x n are chosen uniformly then we have a good probability to 
guess Xg mod k (perfect probability in case of no distortion). A basic information theoretic argu- 
ment now provides a lower bound on d of any such encoding. For instance, in the case there is no 
distortion, the encoding allows us to predict Xi mod k as above with certainty, and the information 
theoretic argument gives the tight bound d > k". We note that this simple yet powerful informa- 
tion theoretic argument appears in various different contexts, such as that of quantum random 
access codes | Nay99[ . 



2 Preliminaries 

All logarithms are base 2. We use [k] to denote the set {l,...,k}. We now list a few basic definitions 
and facts from information theory. Although not really needed for our proof, the interested reader 
can find an introduction to the area in [CT06|. We let H(S) := — SlogS — (1 — 6) log(l — S) denote 
the binary entropy function. For a random variables X on a domain [d] obtaining each value i G [d] 
with probability p ; , the entropy of X is given by H(X) := — Yd p, log p,-, and is always at most log d. 
For two random variables X, Y, the conditional entropy H(X \ Y) is the expectation of H(X | Y = 
y) over y chosen according to Y; this can be seen to equal H(XY) — H(Y). Finally, the mutual 
information I(X : Y) is defined as H(X) + H(Y) - H(XY) = H(X) - H(X|Y), and the conditional 
mutual information I(X : Y | Z) is the expectation of I(X : Y \ Z = z) over z chosen according 
to Z, or equivalently, H(X | Z) + H(Y \ Z) — H(XY | Z). The data processing inequality says that 
applying a function cannot increase mutual information, I(f(X) : Y) < I(X : Y). 

The following claim (which is essentially what is known as Fano's inequality) shows that if one 
random variable can be used to predict another random variable, then their mutual information 
cannot be too small. 

Claim 2.1. Assume X is a random variable uniformly distributed over [k]. Let Y be another random 
variable, and assume that there exists some function f with range [k] such that f(Y) = X with probability 
at least p > 1/2. Then I(X : Y) > logic — (1 — p) log(£ - 1) - H(p). 

Proof. By the data processing inequality, 

I(X : Y) > f(X : f{Y)) = H(X) - H(X | f(Y)) = logic - H(X \ f{Y)), 
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Figure 2: The condition in Eq. ([TJ for r = 1, k = 3. 



so it suffices to bound H(X | /(Y)) from above. Since conditioning cannot increase entropy, 

H(X|/(Y)) = H(1 X=/(Y)/ X|/(Y)) 

= H(1 X=/(Y) | /(Y)) + H(X | 1 X=/(Y) ,/(Y)) 
<H(1 X=/(Y) ) + H(X|1 X=/(Y) ,/(Y)) 
<H(p) + (l-p)log(fc-l). 



□ 



3 Proof 

Our main technical theorem is the following. 

Theorem 3.1. For any k > 2, n > 1 the following holds. Assume f : [2k] n — > IR rf satisfies that for all 
X\, ...,x n S [2k], • • • ,x n ) ||i < 1 and, moreover, that for some e < 1/ (k — 1), and for all £ 6 [n], 

Xi, . . . , G [2fc], and r e[k — 1], 



1 

2fc 



^(/(Xi,...,^!,^) + /(^...^^fc+fc))- 



6=1 



^ (/(xi, . . . , xt-i, b) + /(x X/ . . . , x £ _i, b + k)) 

b=r+l 



> 1 -e 



(1) 



where f(x\, . . ., %i) denotes the average of f{x\, ...,x n ) over Xf +1 , . . . , x n chosen uniformly in [2k]. Then 



d > 2 0ogfc-*lQg(t-l)-H(*))n-l 



2' 



(2) 



where 5:= (fc-l)e/2 < 1/2. 



Before proving the theorem, let us explain how it implies Theorem 1.1 Consider any embed- 
ding F of Pk /tl into i\ with distortion at most 1/(1 — e) for some £ < l/(k — 1). By scaling F, we 
can assume that it is 1-Lipschitz (i.e., it does not expand any distance) and that distances are not 
contracted by more than 1 — e. Let / be the function that maps each x S [2k] n to F(u) — F(v), 
where u is the label of the right endpoint of the edge labeled by x and v is the label of its left 
endpoint. Since F is 1-Lipschitz, ||/(x)||i < 1 for all x S [2k] n . Moreover, it is not difficult to see 
that / satisfies Eq. <[l} (see Figure^. Hence, Theorem |3. 1 1 implies that the bound in Eq. ([2} holds. 
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For the first part of Theorem 1.1 we fix k = 2. We obtain that for any D > 1, any distortion-D 
embedding of G2, n (so e = 1 — 1/D and 3 = 1/2 — 1/ (2D)) must have dimension at least 



1 
2 



2(l-H(l/2-l/2D))rc-l _ j 1 _ 2 n(n/D 2 ) _ jyfi(l/D 2 ) 



For the second part of Theorem [T3j choosing k 1/ (elog(l/e)) and noting that (Slog ft: = O(l), 
we obtain that the dimension must be at least 



i 

{2k) n 2^ 5lo ^ k ^ n ^ 1 — - 



N, 



1-0(1/ l0g(l/E)) 

/c,n 



Proof of Theorem 3.1 We start by considering the case that for all X\,...,x n G [2ft], /(xi, . . . ,x n ) 
has non-negative coordinates and £j-norm 1. We will later see how this implies the general case. 
Making this assumption allows us to think of f{x\, . . . ,x n ) as a probability distribution over [d]. 
Let X = (Xi, . . . , X n ) and M be two random variables where X is uniformly distributed over [2k] n 
and M is distributed over \d] according to /(X). Using the chain rule for mutual information we 
obtain 

logd > H{M) > I(X : M) = I(X a : M) + I(X 2 :M\X 1 ) + -- - + I(X n : M \ X lt . . .,X B _i). 

The following lemma implies that for any I S [n], 

J(X| : M | X,,...,^,) > logfc- tflog(fc- 1) - H(tf) 

(this is true even conditioned on any fixed value of X\, . . ., X(_\, and not just on average) and 
therefore 

d > 2 (logfc- ! 51og(fc-l)-H(tf))n_ 

Lemma 3.2. Let A and B be two random variables such that A is uniformly distributed over [2k] and for 
any a S [2k], conditioned on A = a, B is distributed according to some probability distribution P a on [d]. 
Assume that for all r e [k — 1], 



1 

2k 



«=1 



a=r+\ 



> 1 - £. 



Then I (A : B) > log A; — <Jlog(Jt - 1) - H(S). 



Proof. Let A' = ((A — l) mod k) + 1, and notice that A 1 is uniformly distributed on [k]. By the 
data processing inequality I (A : B) > I (A' : B). For any a G [k], let Q a := (P a + P a+ k)/2 be the 
distribution of B conditioned on A' = a. Our assumption says that for all r G [k — 1], 



E Q fl E Q fl 



«=i 



— i 

fl=r+l 



> 1 



We need the following easy claim. 
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Claim 3.3. For any p\,..., p; c > 0, 



fc-1 



max 



{pi,...,pk} < ~ E LP'' 



r=l 



v i=l 



Ep;~ E ^ 

i=l i=r+l 



Proo/. Let r* G {0, . . . ,k — 1} be the largest such that the expression inside the absolute value is 
negative. Then the sum of the absolute values at r = r* and r = r* + 1 is exactly 2p r * + \. The claim 
follows. □ 

By applying the inequality to each of the d coordinates of the probability distributions Q a , and 
summing the results, we obtain 



1 it, 1 / 
l--||max{Q 1/ ... / Q fc }|| 1 < ~ E 1 

r=l V 



E E 

«=1 «=r+l 



and hence 



maxjQi, . . . , Q k }\\ t > 1 - (k - l)e/2 = 1 - J. 



Consider the function that maps each j G [d] to the a G [k] that maximizes Pr[Q fl = j]. This 
function correctly predicts A' from B with probability jr|| max{Qi,. . . ,Qfc}||i. The lemma now 
follows from Claim |2T] □ 

We now show how to derive a similar bound for any / as in the statement of the theorem. Let 
/ : [2k] n -> R d be such that for all x G [2k] n , f(x) has l x norm at most 1. Define g : [2k] n -> R 2li+1 
by the concatenation 

g{x) := max{/(x) / 0} . max{-/(x),0} . 1 - ||/(ac)||i. 

Obviously for all x, g(x) is non-negative and has l\ norm 1. Moreover, the linear operator that 
maps any y G R 2rf+1 to the vector (x/j — yj + d) d j = \ £ K d cannot increase the l\ norm and maps g(x) 
to f{x) for all x. Therefore Eq. ([l} holds for g, and the theorem follows. □ 
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